-
Notifications
You must be signed in to change notification settings - Fork 43
feat(rust/sedona-geoparquet): Support WKB validation in read_parquet()
#578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| pub object_store: Arc<dyn ObjectStore>, | ||
| pub metadata_size_hint: Option<usize>, | ||
| pub predicate: Arc<dyn PhysicalExpr>, | ||
| pub predicate: Option<Arc<dyn PhysicalExpr>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactor 1: before it's required since GeoParquetFileOpener is only used when there is geo column, and also there is valid spatial predicate, otherwise it fall back to inner parquet opener.
Now it also have to be used if validation is enabled, and there is no spatial predicate, so making it optional.
| pub enable_pruning: bool, | ||
| pub metrics: GeoParquetFileOpenerMetrics, | ||
| pub overrides: Option<HashMap<String, GeoParquetColumnMetadata>>, | ||
| pub options: TableGeoParquetOptions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactor 2: before overrides lives inside TableGeoParquetOptions, the new validate flag also lives inside that option, so I think only keeping the option here can make it simpler.
| metadata_size_hint: Option<usize>, | ||
| predicate: Option<Arc<dyn PhysicalExpr>>, | ||
| overrides: Option<HashMap<String, GeoParquetColumnMetadata>>, | ||
| options: TableGeoParquetOptions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactor 3: similar to another refactor in file_opener.rs
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few optional things to consider...thank you!
Just curious...have you run into files where this is a problem? I imagine if it does happen the existing error would be rather cryptic (although I believe we're good about validating WKB payload in every operator).
| def test_read_parquet_validate_wkb_single_valid_row(con, tmp_path): | ||
| valid_wkb = bytes.fromhex("0101000000000000000000F03F0000000000000040") | ||
|
|
||
| table = pa.table({"id": [1], "geom": [valid_wkb]}) | ||
| path = tmp_path / "single_valid_wkb.parquet" | ||
| pq.write_table(table, path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional, but there is a io/test_parquet.py that is possibly a better fit for these tests (and the geometry_columns tests) since there are now a lot of them!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in 99d52a2
| DataType::Binary => { | ||
| let array = as_binary_array(array)?; | ||
| for (row_index, maybe_wkb) in array.iter().enumerate() { | ||
| if let Some(wkb_bytes) = maybe_wkb { | ||
| if let Err(e) = wkb::reader::read_wkb(wkb_bytes) { | ||
| return exec_err!( | ||
| "WKB validation failed for column '{}' at row {}: {}", | ||
| column_name, | ||
| row_index, | ||
| e | ||
| ); | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're here, it would be good to handle the LargeBinary case (this can happen for reasons outside a GeoParquet producer's control due to various auto-store/load of schemas)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 99d52a2
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Thank you for the review! I have not encountered corrupted file so far, but my concern is the same: if this error is caught inside operators, the error might get propagated several times and produce confusing error messages. |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Follow up to #560
Rationale
read_parquet()now supports an override option,geometry_columns, to cast Binary columns to geometries. This is an unsafe operation, so adding a validation step is helpful.This PR currently limits validation to WKB validity. In the long term, additional metadata properties can also be validated. If any entry fails validation, an error will be returned.
Geo columns are inferred from {(Geo)Parquet metadata, user-provided
geometry_columnsoverride options}. Thevalidateoption applies uniformly in all cases—ifvalidate = true, geometry columns are always validated. I think this approach is more general and simpler.Implementation
Propagate the
validateoption toGeoParquetFileOpener(which yields the final decoded batches as the data source output), and use it to optionally validate the decoded batches for WKB validity.